[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data by colinmjj · Pull Request #40 · apache/uniffle

colinmjj · 2022-07-08T07:37:45Z

What changes were proposed in this pull request?

Fix bug when call inputstream.skip() which may return unexpected result

Why are the changes needed?

Get exception messages as following, and it maybe caused by unexpected data from Local storage

com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586
        at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184)
        at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

Does this PR introduce any user-facing change?

No

How was this patch tested?

With current UTs

…tent data

roryqi · 2022-07-08T07:42:05Z

Could you add a ut for this case?

codecov-commenter · 2022-07-08T07:53:29Z

Codecov Report

Merging #40 (dcf752e) into master (f7c65d4) will increase coverage by 0.01%.
The diff coverage is 55.55%.

@@             Coverage Diff              @@
##             master      #40      +/-   ##
============================================
+ Coverage     56.81%   56.83%   +0.01%     
- Complexity     1203     1207       +4     
============================================
  Files           152      152              
  Lines          8401     8437      +36     
  Branches        813      819       +6     
============================================
+ Hits           4773     4795      +22     
- Misses         3369     3380      +11     
- Partials        259      262       +3

Impacted Files	Coverage Δ
.../uniffle/storage/handler/impl/LocalFileReader.java	`53.33% <55.55%> (-1.22%)`	⬇️
...org/apache/uniffle/server/ShuffleFlushManager.java	`76.70% <0.00%> (-1.71%)`	⬇️
.../java/org/apache/uniffle/server/ShuffleServer.java	`72.26% <0.00%> (-0.30%)`	⬇️
...rg/apache/uniffle/client/util/RssClientConfig.java	`0.00% <0.00%> (ø)`
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java	`0.00% <0.00%> (ø)`
...a/org/apache/uniffle/server/ShuffleServerConf.java	`99.30% <0.00%> (+0.01%)`	⬆️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java	`91.66% <0.00%> (+0.75%)`	⬆️
.../java/org/apache/spark/shuffle/RssSparkConfig.java	`88.88% <0.00%> (+1.38%)`	⬆️
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java	`90.62% <0.00%> (+1.56%)`	⬆️
...org/apache/spark/shuffle/RssSparkShuffleUtils.java	`49.20% <0.00%> (+6.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7c65d4...dcf752e. Read the comment docs.

colinmjj · 2022-07-08T09:08:40Z

Could you add a ut for this case?

The fix is target to potential problem with this api, and it can't be reproduced.

colinmjj · 2022-07-08T09:12:46Z

Actually, I'm not quite sure if the skip() cause the problem, but with the fix, we can check it by the new log.

roryqi · 2022-07-08T09:16:07Z

Do the Hdfs skip api have similar problems?

frankliee · 2022-07-08T09:19:23Z

storage/src/main/java/org/apache/uniffle/storage/handler/impl/LocalFileReader.java

@@ -41,7 +41,22 @@ public LocalFileReader(String path) throws Exception {

  public byte[] read(long offset, int length) {
    try {


at -> targetSkip or maxSkip ?

colinmjj · 2022-07-08T09:22:39Z

Do the Hdfs skip() api have similar problems?

Hdfs has its own api seek() which wraps skip(), eg, DFSInputStream, and it also check the result from skip()

roryqi · 2022-07-08T09:56:03Z

LGTM, I try to reproduce this problem, but I fail.

roryqi · 2022-07-12T07:43:20Z

@colinmjj Do we need to backport this patch to branch 0.5?

colinmjj · 2022-07-12T07:56:46Z

@colinmjj Do we need to backport this patch to branch 0.5?

sure, I'll create a new PR for the backport

…tent data (apache#40) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs

…tent data (#40) (#52) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs

zuston · 2023-08-10T09:10:32Z

Yes, this problem still exist after this patch. And it has no exception log in the server side. @jerqi @colinmjj

…1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: #1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake

…ed. (apache#1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" apache#40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: apache#1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake

[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsis…

a5668d5

…tent data

style fix

c9ea730

colinmjj requested review from frankliee and roryqi July 8, 2022 09:07

frankliee reviewed Jul 8, 2022

View reviewed changes

comment fix

dcf752e

frankliee approved these changes Jul 11, 2022

View reviewed changes

frankliee merged commit cbe39c1 into apache:master Jul 11, 2022

roryqi mentioned this pull request Sep 5, 2022

[Problem] Inconsistent blocks when reading shuffle data #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data#40

[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data#40
frankliee merged 3 commits intoapache:masterfrom
colinmjj:skip_opt

colinmjj commented Jul 8, 2022

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

codecov-commenter commented Jul 8, 2022 •

edited

Loading

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

frankliee Jul 8, 2022

Uh oh!

colinmjj Jul 8, 2022

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

roryqi commented Jul 12, 2022 •

edited

Loading

Uh oh!

colinmjj commented Jul 12, 2022

Uh oh!

zuston commented Aug 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -41,7 +41,22 @@ public LocalFileReader(String path) throws Exception {

		public byte[] read(long offset, int length) {
		try {

Comments

Conversation

colinmjj commented Jul 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

codecov-commenter commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

frankliee Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

colinmjj Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

colinmjj commented Jul 8, 2022

Uh oh!

roryqi commented Jul 8, 2022

Uh oh!

roryqi commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colinmjj commented Jul 12, 2022

Uh oh!

zuston commented Aug 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jul 8, 2022 •

edited

Loading

roryqi commented Jul 12, 2022 •

edited

Loading